course: (ML Series) Logistic Regression

By Julien Hernandez Lallement, 2020-10-15, in category Course

machine learning, python

Logistic Regression

General Background

Generally speaking, regression methods are supervised learning techniques. They use continous scaled variables (independent variables) to predict the behavior of a dependent variable. They can use different equations that will fit straight lines (linear regression), polynomial functions (detecting interaction effects) or other funcftions to predict the dependent variable.

In this post, I will be focusing on logistic regressions, which are typically used as classification algorithms. The logistic regression is however not a classsification algorithm per se, but can be used as such when a probability threshold is set to differentiate between classes.

Use case

Logisti regressions can be used in different scenarios:

  • Classifying data based on continous scaled features (when the dependent variable is a categorical data, typically binary, i.e., 0 and 1)
  • Classify whether an email is a spam or not

Theoretical Background

Let's first explain why the most common use of Logistic Regression is classification. Again, to be clear, Logistic Regression is not a classification algorithm per se. It can be used for classification is a threshold is set, above and below which the vector of features will categorize the data point as belonging to one class or the other.

Look at the graph below:

In [51]:
plt.scatter(positive[1],positive[0]);
plt.scatter(negative[1],negative[0]);

sigmoid = lambda x: 1 / (1 + np.exp(-x))
x=np.linspace(-10,10,100)
plt.plot(y,sigmoid(x),'b', label='linspace(-10,10,100)')
plt.plot(np.linspace(-7.5,6,100), np.linspace(0,1,100), 'g')
plt.grid()
plt.ylabel('Class')
Out[51]:
Text(0, 0.5, 'Class')

Here, the y axis shows the classified nature of the data points (spam or not spam), and the x axis shows a feature of these data points (say, the number of spelling mistakes in the mail's object).

As you can imagine, a linear fit (green line; not real linear fit) would be quite poor in this case. If new data comes in, with y=1, but with a high value of x, your linear fit will show a tremendous increase in residuals, and fit the data poorly.

Moreover, if the y axis does not reflect classes but probabilities of belonging to a class, you will have issues because your linear fit might predict values >1 or <0, which would be nonsense.

One way to circumvent this problem is to use a sigmoidal function, such as the logistic function:

$ h_0(x) = \frac{1}{1 + e^{-\theta_o - \theta_1x}} $

This function is shown in the figure above as blue line. Using it, a new comer data point would be classified as 1 or 0 based on the probability threshold for $h_0$.

Let's have a closer look at it:

In [55]:
import math
x_values = np.linspace(-5, 5, 100)
y_values = [1 / (1 + math.e**(-x)) for x in x_values]
plt.plot(x_values, y_values)
plt.axhline(.5)
plt.axvline(0)
Out[55]:
<matplotlib.lines.Line2D at 0x7f7a34539210>

You can see why this is a great function for a probability measure. The y-value represents the probability and only ranges between 0 and 1. Also, for an x value of zero you get a .5 probability and as you get more positive x values you get a higher probability and more negative x values a lower probability.

The aforementionned threshold should be decided based on business knowledge and common sense. A threshold = 0.5 would be quite flexible, allowing some "smart" spams (if that exists) to land in the mailbox. In turn, we might increase our threshold to avoid any spam to land in the mailbox, while taking the risk of getting false positive (good mails that land in the spam folder). Trial and error is a good technique to monitor the threshold and its efficiency.

Once we have our features, we need to use in our logistic function.

For example:

$$x = \beta_{0} + \beta{1}Feat1 + \beta_{2}Feat2 $$

Where Feat1 is our value for Feature 1 and Feat2 is our value for Feature 2. For those of you familiar with Linear Regression this looks very familiar. Basically we are assuming that x is a linear combination of our data plus an intercept.

For example, take the iris dataset. Say we have a plant with a sepal width of 3.5 and a sepal length of 5.Say we already know the parameters $\beta$(s) do that $\beta_{0} = 1$, $\beta_{1} = 2$, and $\beta_{2} = 4$. This would imply:

$$x = 1 + (2 * 3.5) + (4 * 5) = 28$$

Plugging this into our logistic function gives:

$$\frac{1}{1 + e^{-28}} = .99$$

So we would give a 99% probability to a plant with those dimensions as being Setosa.

While we entered static values of $\beta$ here, these paremeters are the ones that the machine should learn, and what Logistic Regression will do for you.

Cost function

While for the Linear Regression & variants, the cost function was typically a Residual Sum of Squares, this becomes a bit more tricky with the Logistic Regression. Indeed, while the cost function of Linear Regression have a single minimum, that is not always the case for logistic functions.

One typically used cost function is the following, which do have a single minimum:

$ J(\theta) = -\frac{1}{N} \sum \limits _{n=1} ^{N} (y^n Ln(h_\theta * x^n)) + (1 - y^n) Ln(1 - h_0 * x^n) $

This function might seem quite complex at first, but when scrutunizing it, some might already see an analogy with the Maximum Likelihood Estimation using probabilities (see my post on Introduction to Machine Learning). And indeed, there is a clear link since minimizing this cost function means that you would have to maximize the likelihood of that data belonging to a class.

In other words, one can find that when y = 1, $J(0)$ decreases monotonically towards 0 when $h_\theta$ is equal to 1. Similarly, when y = 0, the cost function is also equal to 0 when $h_\theta$ is equal to 0.

Practical Demonstration

World of Warcraft dataset

We will use Logistic regression to predict gamers that will churn. Check this repo for the full data source, or get a small version of the data from my own repo, in the datasets folder. Check as well this nice video for a full explanation of the data.

To be honest about the code, the data pipe below was developed during a course I attended at the Netherlands Institute for Neuroscience in 2019. I modified a few bits here and there add some data features, but the general logic was defined by group during the course.

As you will see, most of the functions are preparatory function.

One important one is the add_churn_label function, which is where the status of churner is defined. This is important, since it is the pattern that the algorithm will try to predict in unseen data.

The definition is a classic churn definition, with users having played in a certain time window, and having not logged into the game in a future time window.

Note that I focus on the implementation of the algorithm, so as usual in this series of post, i won't be going in data exploration and complex feature engineering. I leave the fun to you to explore that dataset, and increase the accurccy of the model.

In [2]:
import os
os.chdir('/home/julien/website/datasets')
In [30]:
df = pd.read_csv("wowah_data_small.csv")
In [27]:
import pandas as pd
import numpy as np
import datetime as dt
from datetime import timedelta
import plotnine as p9
In [53]:
def start_pipeline(dataf):
    return dataf.copy()

def fix_columns(dataf):
    dataf.columns = [c.lower().replace(" ","") for c in dataf.columns]
    return dataf

def fix_types(dataf):
    return dataf.assign(timestamp=lambda d: 
                        pd.to_datetime(d['timestamp'], 
                                       format="%m/%d/%y %H:%M:%S"))

def add_ts_features(dataf):
    return (dataf
           .assign(hour = lambda d: d['timestamp'].dt.hour)
           .assign(day_of_week = lambda d: d['timestamp'].dt.dayofweek))

def add_month_played(dataf):
    return (dataf
           .assign(month_played = lambda d: d['timestamp'].dt.month_name()))

def following_timestamp(dataf):
    return (dataf
           .sort_values(by=['char','timestamp'])
           .assign(following_timestamp = lambda d: (d['timestamp'].shift()))
           .assign(following_char = lambda d: (d['char'].shift()))
           )

def played_time(dataf):
    return (dataf
           .assign(played_time = lambda d: (d['timestamp'] - d['following_timestamp'])))

def add_churn_label(dataf, before_period=("2008-01-01", "2008-03-01"),
                    after_period=("2008-04-01", "2008-06-01"), min_rows=10):
    before_df = (dataf
                 .loc[lambda d: d['timestamp'] >= pd.to_datetime(before_period[0])]
                 .loc[lambda d: d['timestamp'] < pd.to_datetime(before_period[1])])
 
    after_df = (dataf
                 .loc[lambda d: d['timestamp'] >= pd.to_datetime(after_period[0])]
                 .loc[lambda d: d['timestamp'] < pd.to_datetime(after_period[1])])
 
    before_chars = (before_df
     .groupby("char")
     .count()
     .loc[lambda d: d['level'] > min_rows]
     .reset_index()['char'])
 
    after_chars = (after_df
     .groupby("char")
     .count()
     .reset_index()['char'])
 
    return (before_df
     .loc[lambda d: d['char'].isin(before_chars)]
     .assign(churned = lambda d: d['char'].isin(after_chars) == False))

df_clean = (df
 .pipe(start_pipeline)
 .pipe(fix_columns)
 .pipe(fix_types)
 .pipe(add_ts_features)
 .pipe(add_month_played)
 .pipe(following_timestamp)     
 .pipe(played_time) 
 .pipe(add_churn_label)       
)

The feature char defines the user in the dataframe, together with the entries for playing. Let's group by that and compute a few relevant data features to feed the model:

In [60]:
ml_df = (df_clean
       .groupby(['char'])
       .apply(lambda d: pd.Series({
           "time_played": d.shape[0],
           "churned": d['churned'].any(),
           "mean_level": d['level'].mean(),
           "var_level": d['level'].var(),
           "guild_bool": d['guild'].any(),
           "max_level": d['level'].max().astype(float),
           "min_level": d['level'].min().astype(float)
       }))
        .assign(level_speed = lambda d: (d['max_level'] - d['min_level']) / d['time_played'])
         )

OK! We are ready to fit a model to predict churn. This post is about Logistic Regression but I will use another classification algorithm, the KNeighbors Classifier, as comparison.

In [61]:
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

from sklego.preprocessing import ColumnSelector
In [ ]:
# Pre Processing Pipeline
panda_grabber = Pipeline([
    ("union", FeatureUnion([
        ("continous", Pipeline([
            ("select", ColumnSelector(["max_level", "min_level"])),
            ("scale", StandardScaler())
        ])),
        ("discrete", Pipeline([
            ("select", ColumnSelector(["guild_bool"])),
            ("encode", OneHotEncoder(sparse=False))
        ]))
    ]))
])

# Main Pipeline
pipeline = Pipeline([
    ("grab", panda_grabber),
    ("poly", PolynomialFeatures(interaction_only=True)),
    #("scale", StandardScaler()), 
    ("model", KNeighborsClassifier(10, weights='distance'))
])

Below, I define the different models to be tested.

Note that I will be using LBFGS as optimization method. See here for information. For gradient descent optimization, see the post on Linear Regression method.

In [ ]:
# Define polynomial
param_poly = [PolynomialFeatures(interaction_only=True), None]

# Define models to test
param_model = [KNeighborsClassifier(1), 
               KNeighborsClassifier(10), 
               LogisticRegression(solver='lbfgs')]

# Define Grid Search
mod = GridSearchCV(pipeline,
                   #iid=True,
                   return_train_score=True,
                   param_grid={"model": param_model, # This overwrites the pipeline declaration of the model
                               "poly": param_poly},
                   cv=10)

let's fit the different models

In [63]:
mod.fit(ml_df,ml_df['churned']);
In [64]:
# look at the cross validation results and add the number of neighbors used.
pd.DataFrame(mod.cv_results_).T
Out[64]:
0 1 2 3 4 5
mean_fit_time 0.00785432 0.00617559 0.0076494 0.00654972 0.0267492 0.0154052
std_fit_time 0.0011228 0.000319191 0.000846511 0.000612706 0.00277864 0.00545565
mean_score_time 0.0110934 0.00921001 0.0115895 0.0104084 0.00566826 0.00330086
std_score_time 0.00215004 0.000501717 0.00153521 0.000850591 0.000125533 0.00119705
param_model KNeighborsClassifier(algorithm='auto', leaf_si... KNeighborsClassifier(algorithm='auto', leaf_si... KNeighborsClassifier(algorithm='auto', leaf_si... KNeighborsClassifier(algorithm='auto', leaf_si... LogisticRegression(C=1.0, class_weight=None, d... LogisticRegression(C=1.0, class_weight=None, d...
param_poly PolynomialFeatures(degree=2, include_bias=True... None PolynomialFeatures(degree=2, include_bias=True... None PolynomialFeatures(degree=2, include_bias=True... None
params {'model': KNeighborsClassifier(algorithm='auto... {'model': KNeighborsClassifier(algorithm='auto... {'model': KNeighborsClassifier(algorithm='auto... {'model': KNeighborsClassifier(algorithm='auto... {'model': LogisticRegression(C=1.0, class_weig... {'model': LogisticRegression(C=1.0, class_weig...
split0_test_score 0.746094 0.742188 0.753906 0.75 0.746094 0.746094
split1_test_score 0.726562 0.722656 0.746094 0.75 0.746094 0.746094
split2_test_score 0.761719 0.753906 0.753906 0.757812 0.75 0.746094
split3_test_score 0.765625 0.195312 0.773438 0.773438 0.765625 0.761719
split4_test_score 0.707031 0.734375 0.75 0.734375 0.75 0.75
split5_test_score 0.695312 0.722656 0.75 0.75 0.742188 0.75
split6_test_score 0.710938 0.703125 0.746094 0.753906 0.742188 0.746094
split7_test_score 0.707031 0.671875 0.742188 0.753906 0.777344 0.773438
split8_test_score 0.617188 0.574219 0.714844 0.707031 0.714844 0.804688
split9_test_score 0.47451 0.45098 0.560784 0.52549 0.407843 0.407843
mean_test_score 0.691201 0.627129 0.729125 0.725596 0.714222 0.723206
std_test_score 0.0825324 0.16945 0.0577515 0.068688 0.103272 0.106598
rank_test_score 5 6 1 2 4 3
split0_train_score 0.836735 0.835432 0.772471 0.773339 0.761615 0.759444
split1_train_score 0.841511 0.841511 0.778116 0.777681 0.758576 0.759878
split2_train_score 0.847156 0.828485 0.776379 0.774642 0.761615 0.75901
split3_train_score 0.845419 0.50977 0.7703 0.768997 0.759878 0.760313
split4_train_score 0.838472 0.815458 0.774642 0.775944 0.75901 0.758142
split5_train_score 0.829353 0.8363 0.775076 0.77551 0.760747 0.75901
split6_train_score 0.846287 0.837169 0.775076 0.771168 0.759878 0.758576
split7_train_score 0.825011 0.818498 0.773339 0.7703 0.758142 0.756839
split8_train_score 0.827616 0.818063 0.771168 0.770734 0.750326 0.751628
split9_train_score 0.825521 0.820312 0.78125 0.78125 0.768663 0.766927
mean_train_score 0.836308 0.7961 0.774782 0.773957 0.759845 0.758977
std_train_score 0.00836732 0.0958596 0.00310267 0.00361133 0.00425055 0.00354195
In [65]:
pd.DataFrame(mod.cv_results_)[['param_model', 'mean_train_score', 'mean_test_score','std_test_score']]
Out[65]:
param_model mean_train_score mean_test_score std_test_score
0 KNeighborsClassifier(algorithm='auto', leaf_si... 0.836308 0.691201 0.082532
1 KNeighborsClassifier(algorithm='auto', leaf_si... 0.796100 0.627129 0.169450
2 KNeighborsClassifier(algorithm='auto', leaf_si... 0.774782 0.729125 0.057752
3 KNeighborsClassifier(algorithm='auto', leaf_si... 0.773957 0.725596 0.068688
4 LogisticRegression(C=1.0, class_weight=None, d... 0.759845 0.714222 0.103272
5 LogisticRegression(C=1.0, class_weight=None, d... 0.758977 0.723206 0.106598

As we can see, the Logistic Regression shows a nice score in the test set. It is not much better than the KNN classifier, but in my opinion, is quite intuitive to explain to non tech stakeholders and might have therefore higher explainability power.

Final words

That was it for Logistic Regression. Hope you enjoyed it, and drop a mail if you have any comments, questions or (constructive_ criticism).